Discretization of Numerical Attributes Preprocessing for Machine Learning

نویسنده

  • Knut Magne Risvik
چکیده

Page 2 of 46 Abstract The area of Knowledge discovery and Data mining is growing rapidly. A large number of methods is employed to mine knowledge. Several of the methods rely of discrete data. However, most datasets used in real application have attributes with continuously values. To make the data mining techniques useful for such datasets, discretization is performed as a preprocessing step of the data mining. In this paper we examine a few common methods for discretization and test these algorithms on common datasets. We also propose a method for reducing the number of intervals resulting from an orthogonal discretization by compromising the consistency level of a decision system. The algorithms have been evaluated using a rough set toolkit for data analysis. In Chapter 2, we introduce Knowledge Discovery. We also discuss the preprocessing step in the Knowledge Discovery pipeline, and introduce discretization in particular. Chapter 3 introduces some basic notions in Rough Set theory. In Chapter 4, we further discusses the discretization process, and investigates some common methods for discretization. In Chapter 5, we propose a two-step approach to discretization, using the Naive discretization algorithm introduces in 4.2, and a proposed algorithm to merge intervals. Empirical results from comparison of the different algorithms discussed in Chapter 4, and the proposed method from Chapter 5 can be found in Chapter 6. In Chapter 7, further work in the area of discretization is discussed. Appendix A contains some notes on the Rosetta[15] framework, and the implementation of the investigated algorithms in particular.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Empirical comparisons of various discretizationprocedures

The genuine symbolic machine learning (ML) algorithms are capable of processing symbolic, categorial data only. However, real-world problems, e.g. in medicine or nance, involve both symbolic and numerical attributes. Therefore, there is an important issue of ML to discretize (categorize) numerical attributes. There exist quite a few discretization procedures in the ML eld. This paper describes ...

متن کامل

Discretization and Grouping: Preprocessing Steps for Data Mining

Unlike on-line discretization performed by a number of machine learning (ML) algorithms for building decision trees or decision rules, we propose off-line algorithms for discretizing numerical attributes and grouping values of nominal attributes. The number of resulting intervals obtained by discretization depends only on the data; the number of groups corresponds to the number of classes. Sinc...

متن کامل

Global discretization of continuous attributes as preprocessing for machine learning

Real-life data usually are presented in databases by real numbers. On the other hand, most inductive learning methods require a small number of attribute values. Thus it is necessary to convert input data sets with continuous attributes into input data sets with discrete attributes. Methods of discretization restricted to single continuous attributes will be called local, while methods that sim...

متن کامل

Optimized Preprocessing for Accurate and Efficient Bioassay Prediction with Machine Learning Algorithms

Bioassay is the measurement of the potency of a chemical substance by its effect on a living animal or plant tissue. Bioassay data and chemical structures from pharmacokinetic and drug metabolism screening are mined from and housed in multiple databases. Bioassay prediction is calculated accordingly to determine further advancement. This paper proposes a four-step preprocessing of datasets for ...

متن کامل

Discretization of Continuous Attributes in Supervised Learning algorithms

We propose a new algorithm, called CILA, for discretization of continuous attribute. The CILA algorithm can be used with any class labeled data. The tests performed using the CILA algorithm show that it generates discretization schemes with almost always the highest dependence between the class labels and the discrete intervals, and always with significantly lower number of intervals, when comp...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007